INTERSPEECH.2023 - Speech Processing | Cool Papers

#1 CQNV: A Combination of Coarsely Quantized Bitstream and Neural Vocoder for Low Rate Speech Coding [PDF⁵] [Copy] [Kimi⁸] [REL]

Authors: Youqiang Zheng, Li Xiao, Weiping Tu, Yuhong Yang, Xinmeng Xu

Recently, speech codecs based on neural networks have proven to perform better than traditional methods. However, redundancy in traditional parameter quantization is visible within the codec architecture of combining the traditional codec with the neural vocoder. In this paper, we propose a novel framework named CQNV, which combines the coarsely quantized parameters of a traditional parametric codec to reduce the bitrate with a neural vocoder to improve the quality of the decoded speech. Furthermore, we introduce a parameters processing module into the neural vocoder to enhance the application of the bitstream of traditional speech coding parameters to the neural vocoder, further improving the reconstructed speech's quality. In the experiments, both subjective and objective evaluations demonstrate the effectiveness of the proposed CQNV framework. Specifically, our proposed method can achieve higher quality reconstructed speech at 1.1 kbps than Lyra and Encodec at 3 kbps.

Subject: INTERSPEECH.2023 - Speech Processing

#2 Target Speech Extraction with Conditional Diffusion Model [PDF¹] [Copy] [Kimi⁹] [REL]

Authors: Naoyuki Kamo, Marc Delcroix, Tomohiro Nakatani

Diffusion model-based speech enhancement has received increased attention since it can generate very natural enhanced signals and generalizes well to unseen conditions. Diffusion models have been explored for several sub-tasks of speech enhancement, such as speech denoising, dereverberation, and source separation. In this paper, we investigate their use for target speech extraction (TSE), which consists of estimating the clean speech signal of a target speaker in a mixture of multi-talkers. TSE is realized by conditioning the extraction process on a clue identifying the target speaker. We show we can realize TSE using a conditional diffusion model conditioned on the clue. Besides, we introduce ensemble inference to reduce potential extraction errors caused by the diffusion process. In experiments on Libri2mix corpus, we show that the proposed diffusion model-based TSE combined with ensemble inference outperforms a comparable TSE system trained discriminatively.

Subject: INTERSPEECH.2023 - Speech Processing

#3 Towards Fully Quantized Neural Networks For Speech Enhancement [PDF⁴] [Copy] [Kimi⁶] [REL]

Authors: Elad Cohen, Hai Victor Habi, Arnon Netzer

Deep learning models have shown state-of-the-art results in speech enhancement. However, deploying such models on an eight-bit integer-only device is challenging. In this work, we analyze the gaps in deploying a vanilla quantization-aware training method for speech enhancement, revealing two significant observations. First, quantization mainly affects signals with a high input Signal-to-Noise Ratio (SNR). Second, quantizing the model's input and output shows major performance degradation. Based on our analysis, we propose Fully Quantized Speech Enhancement (FQSE), a new quantization-aware training method that closes these gaps and enables eight-bit integer-only quantization. FQSE introduces data augmentation to mitigate the quantization effect on high SNR. Additionally, we add an input splitter and a residual quantization block to the model to overcome the error of the input-output quantization. We show that FQSE closes the performance gaps induced by eight-bit quantization.

Subject: INTERSPEECH.2023 - Speech Processing

#4 Complex Image Generation SwinTransformer Network for Audio Denoising [PDF⁴] [Copy] [Kimi⁵] [REL]

Authors: Youshan Zhang, Jialu Li

Achieving high-performance audio denoising is still a challenging task in real-world applications. Existing time-frequency methods often ignore the quality of generated frequency domain images. This paper converts the audio denoising problem into an image generation task. We first develop a complex image generation SwinTransformer network to capture more information from the complex Fourier domain. We then impose structure similarity and detailed loss functions to generate high-quality images and develop an SDR loss to minimize the difference between denoised and clean audios. Extensive experiments on two benchmark datasets demonstrate that our proposed model is better than state-of-the-art methods.

Subject: INTERSPEECH.2023 - Speech Processing

#5 Biophysically-inspired single-channel speech enhancement in the time domain [PDF²] [Copy] [Kimi³] [REL]

Authors: Chuan Wen, Sarah Verhulst

Most state-of-the-art speech enhancement (SE) methods utilize time-frequency (T-F) features or waveforms as input features and have poor generalizability at negative signal-to-noise ratios (SNR). To overcome these issues, we propose a novel network that integrates biophysical properties of the human auditory system known to perform even at negative SNRs. We generated biophysical features using CoNNear, a neural network auditory model, which were fed into a SOTA speech enhancement model AECNN. The model was trained on the INTERSPEECH 2021 DNS Challenge dataset and evaluated on mismatched noise conditions at various SNRs. The experimental results revealed that the bio-inspired approaches outperformed T-F and waveform features under positive SNRs and demonstrated stronger robustness to unseen noise at negative SNRs. We conclude that incorporating human-like features can extend the operating range of SE systems to more negative SNRs.

Subject: INTERSPEECH.2023 - Speech Processing

#6 On-Device Speaker Anonymization of Acoustic Embeddings for ASR based on Flexible Location Gradient Reversal Layer [PDF] [Copy] [Kimi²] [REL]

Authors: Md Asif Jalal, Pablo Peso Parada, Jisi Zhang, Mete Ozay, Karthikeyan Saravanan, Myoungji Han, Jung In Lee, Seokyeong Jung

Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task Automatic Speech Recognition (ASR). The proposed framework attaches flexible gradient reversal based speaker adversarial layers to target layers within an ASR model, where speaker adversarial training anonymizes acoustic embeddings generated by the targeted layers to remove speaker identityy. We propose on-device deployment by execution of initial layers of the ASR model, and transmitting anonymized embeddings to the cloud, where the rest of the model is executed while preserving privacy. The results show that our method efficiently reduces speaker recognition relative accuracy by 33%, and improves ASR performance by achieving 6.2% relative Word Error Rate (WER) reduction.

Subject: INTERSPEECH.2023 - Speech Processing

#7 How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning [PDF] [Copy] [Kimi²] [REL]

Authors: Hye-jin Shim, Rosa Gonzalez Hautamäki, Md Sahidullah, Tomi Kinnunen

Shortcut learning, or 'Clever Hans effect' refers to situations where a learning agent (e.g., deep neural networks) learns spurious correlations present in data, resulting in biased models. We focus on finding shortcuts in deep learning based spoofing countermeasures (CMs) that predict whether a given utterance is spoofed or not. While prior work has addressed specific data artifacts, such as silence, no general normative framework has been explored for analyzing shortcut learning in CMs. In this study, we propose a generic approach to identifying shortcuts by introducing systematic interventions on the training and test sides, including the boundary cases of 'near-perfect' and 'worse than coin flip' (label flip). By using three different models, ranging from classic to state-of-the-art, we demonstrate the presence of shortcut learning in five simulated conditions. We also analyze the results using a regression model to understand how biases affect the class-conditional score statistics.

Subject: INTERSPEECH.2023 - Speech Processing

#8 CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram [PDF³] [Copy] [Kimi⁴] [REL]

Authors: Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

In this work, we present CleanUNet 2, a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds. CleanUNet 2 uses a two-stage framework inspired by popular speech synthesis methods that consist of a waveform model and a spectrogram model. Specifically, CleanUNet 2 builds upon CleanUNet, the state-of-the-art waveform denoiser, and further boosts its performance by taking predicted spectrograms from a spectrogram denoiser as the input. We demonstrate that CleanUNet 2 outperforms previous methods in terms of various objective and subjective evaluations.

Subject: INTERSPEECH.2023 - Speech Processing

#9 A Two-stage Progressive Neural Network for Acoustic Echo Cancellation [PDF] [Copy] [Kimi²] [REL]

Authors: Zhuangqi Chen, Xianjun Xia, Cheng Chen, Xianke Wang, Yanhong Leng, Li Chen, Roberto Togneri, Yijian Xiao, Piao Ding, Shenyi Song, Pingjian Zhang

Recent studies in deep learning based acoustic echo cancellation proves the benefits of introducing a linear echo cancellation module. However, the convergence problem and potential target speech distortion impose an additional learning burden for the neural network. In this paper, we propose a two-stage progressive neural network consisting of a coarse-stage and a fine-stage module. For the coarse-stage, a light-weighted network module is designed to suppress partial echo and potential noise, where a voice activity detection path is used to enhance the learned features. For the fine-stage, a larger network is employed to deal with the more complex echo path and restore the near-end speech. We have conducted extensive experiments to verify the proposed method, and the results show that the proposed two-stage method provides a superior performance to other state-of-the-art methods.

Subject: INTERSPEECH.2023 - Speech Processing

#10 An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec [PDF] [Copy] [Kimi¹] [REL]

Authors: Linping Xu, Jiawei Jiang, Dejun Zhang, Xianjun Xia, Li Chen, Yijian Xiao, Piao Ding, Shenyi Song, Sixing Yin, Ferdous Sohel

Recently, neural networks have proven to be effective in performing speech coding task at low bitrates. However, underutilization of intra-frame correlations and the error of quantizer specifically degrade the reconstructed audio quality. To improve the coding quality, we present an end-to-end neural speech codec, namely CBRC (Convolutional and Bidirectional Recurrent neural Codec). An interleaved structure using 1D-CNN and Intra-BRNN is designed to exploit the intra-frame correlations more efficiently. Furthermore, Group-wise and Beamsearch Residual Vector Quantizer (GB-RVQ) is used to reduce the quantization noise. CBRC encodes audio every 20ms with no additional latency, which is suitable for real-time communication. Experimental results demonstrate the superiority of the proposed codec when comparing CBRC at 3kbps with Opus at 12kbps.

Subject: INTERSPEECH.2023 - Speech Processing

#11 Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations [PDF] [Copy] [Kimi²] [REL]

Authors: Shucong Zhang, Malcolm Chadwick, Alberto Gil C. P. Ramos, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya

Personalised speech enhancement (PSE) extracts only the speech of a target user and removes everything else from corrupted input audio. This can greatly improve on-device streaming audio processing, such as voice calls and speech recognition, which has strict requirements on model size and latency. To focus the PSE system on the target speaker, it is conditioned on a recording of the user's voice. This recording is usually summarised as a single static vector. However, a static vector cannot reflect all the target user's voice characteristics. Thus, we propose using the full recording. To condition on such a variable-length sequence, we propose fully Transformer-based PSE models with a cross-attention mechanism which generates target speaker representations dynamically. To better reflect the on-device scenario, we carefully design and publish a new PSE dataset. On the dataset, our proposed model significantly surpasses strong baselines while halving the model size and reducing latency.

Subject: INTERSPEECH.2023 - Speech Processing

#12 CFTNet: Complex-valued Frequency Transformation Network for Speech Enhancement [PDF³] [Copy] [Kimi²] [REL]

Authors: Nursadul Mamun, John H. L. Hansen

It is widely known that the presence of multi-speaker babble noise greatly degrades speech intelligibility. However, suppressing noise without creating artifacts in human speech is challenging in environments with a low signal-to-noise ratio (SNR), and even more so if noise is speechlike such as babble noise. Deep learning-based systems either enhance the magnitude response and reuse distorted phases or enhance the complex spectrogram. Frequency transformation block (FTB) has emerged as a useful architecture to implicitly capture harmonic correlation which is especially important for people with hearing loss (hearing aid/ cochlear implant users). This study proposes a complex-valued frequency transformation network (CFTNet) for speech enhancement, which leverages both a complex-valued U-Net and FTB to capture sufficient low-level contextual information. The proposed system learns a complex transformation matrix to accurately recover speech in the time-frequency domain from a noisy spectrogram. Experimental results demonstrate that the proposed system can achieve significant improvements in both seen and unseen noise over state-of-art networks. Furthermore, the proposed CFTNet can suppress highly nonstationary noise without creating musical artifacts commonly observed in conventional enhancement methods.

Subject: INTERSPEECH.2023 - Speech Processing

#13 Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement [PDF²] [Copy] [Kimi¹] [REL]

Authors: Hejung Yang, Hong-Goo Kang

Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning because they are able to extract high-quality salient features from input data. As such, they have been frequently used as base networks for various pattern classification tasks such as speech recognition. However, not much research has been conducted on applying these types of models to the field of speech signal generation. In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task. To alleviate mismatches between the input features of the pre-trained model and the target enhancement model, we adopt a novel feature normalization technique to smoothly link these modules together. Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.

Subject: INTERSPEECH.2023 - Speech Processing

#14 Multi-mode Neural Speech Coding Based on Deep Generative Networks [PDF] [Copy] [Kimi¹] [REL]

Authors: Wei Xiao, Wenzhe Liu, Meng Wang, Shan Yang, Yupeng Shi, Yuyong Kang, Dan Su, Shidong Shang, Dong Yu

The wideband or super wideband speech is one of the most prominent features in real-time communication services, with higher resolution spectrum. However, it requires higher computing expenses. In this paper, we introduce the Penguins codec, based on a multi-mode neural speech coding structure that combines sub-band speech processing and applies different strategies from the low band to the high band. Especially, it refers to deep generative networks with perceptual constraint loss functions and knowledge distillations to reconstruct wideband components and bandwidth extension to generate artificial super wideband components. The method results in high-quality speech at very low bitrates. Several subjective and objective experiments, including ablation studies, were organized, and the results proved the merit of the proposed scheme when compared with traditional coding schemes and state-of-the-art neural coding methods.

Subject: INTERSPEECH.2023 - Speech Processing

#15 Streaming Dual-Path Transformer for Speech Enhancement [PDF²] [Copy] [Kimi³] [REL]

Authors: Soo Hyun Bae, Seok Wan Chae, Youngseok Kim, Keunsang Lee, Hyunjin Lim, Lae-Hoon Kim

Speech enhancement employing a dual-path transformer (DPT) with a dilated DenseNet-based encoder and decoder has shown state-of-the-art performance. By applying attention in both time and frequency paths, the DPT learns the long-term dependency of speech and the relationship between frequency components. However, the batch processing of the DPT, which performs attention on all past and future frames, makes it impractical for real-time applications. To satisfy the real-time requirement, we propose a streaming dual-path transformer (stDPT) with zero look-ahead structure. In the training phase, we apply masking techniques to control the context length, and in the inference phase, caching methods are utilized to preserve sequential information. Extensive experiments have been conducted to show the performance based on different context lengths, and the results verify that the proposed method outperforms the current state-of-the-art speech enhancement models based on real-time processing.

Subject: INTERSPEECH.2023 - Speech Processing

#16 Sequence-to-Sequence Multi-Modal Speech In-Painting [PDF] [Copy] [Kimi²] [REL]

Authors: Mahsa Kadkhodaei Elyaderani, Shahram Shirani

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the effectiveness of the introduced multi-modality in speech in-painting.

Subject: INTERSPEECH.2023 - Speech Processing

#17 Hybrid AHS: A Hybrid of Kalman Filter and Deep Learning for Acoustic Howling Suppression [PDF] [Copy] [Kimi¹] [REL]

Authors: Hao Zhang, Meng Yu, Yuzhong Wu, Tao Yu, Dong Yu

Deep learning has been recently introduced for efficient acoustic howling suppression (AHS). However, the recurrent nature of howling creates a mismatch between offline training and streaming inference, limiting the quality of enhanced speech. To address this limitation, we propose a hybrid method that combines a Kalman filter with a self-attentive recurrent neural network (SARNN) to leverage their respective advantages for robust AHS. During offline training, a pre-processed signal obtained from the Kalman filter and an ideal microphone signal generated via teacher-forced training strategy are used to train the deep neural network (DNN). During streaming inference, the DNN's parameters are fixed while its output serves as a reference signal for updating the Kalman filter. Evaluation in both offline and streaming inference scenarios using simulated and real-recorded data shows that the proposed method efficiently suppresses howling and consistently outperforms baselines.

Subject: INTERSPEECH.2023 - Speech Processing

#18 Differentially Private Adapters for Parameter Efficient Acoustic Modeling [PDF] [Copy] [Kimi¹] [REL]

Authors: Chun-Wei Ho, Chao-Han Huck Yang, Sabato Marco Siniscalchi

In this work, we devise a parameter-efficient solution to bring differential privacy (DP) guarantees into adaptation of a cross-lingual speech classifier. We investigate a new frozen pretrained adaptation framework for DP-preserving speech modeling without full model fine-tuning. First, we introduce a noisy teacher-student ensemble into a conventional adaptation scheme leveraging a frozen pre-trained acoustic model and attain superior performance than DP-based stochastic gradient descent (DPSGD). Next, we insert residual adapters (RA) between layers of the frozen pre-trained acoustic model. The RAs reduce training cost and time significantly with a negligible performance drop. Evaluated on the open-access Multilingual Spoken Words (MLSW) dataset, our solution reduces the number of trainable parameters by 97.5% using the RAs with only a 4% performance drop with respect to fine-tuning the cross-lingual speech classifier while preserving DP guarantees.

Subject: INTERSPEECH.2023 - Speech Processing

#19 Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement through Knowledge Distillation [PDF] [Copy] [Kimi¹] [REL]

Authors: Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling

Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images.

Subject: INTERSPEECH.2023 - Speech Processing

#20 Consonant-emphasis Method Incorporating Robust Consonant-section Detection to Improve Intelligibility of Bone-conducted speech [PDF] [Copy] [Kimi¹] [REL]

Authors: Yasufumi Uezu, Sicheng Wang, Teruki Toya, Masashi Unoki

A consonant-emphasis (CE) method was proposed to improve the word intelligibility of presented speech by using bone-conducted (BC) headphones. However, the consonant-section detection (CSD) performance of this method is not robust against certain consonants. Therefore, a CE method with robust CSD is necessary for presented BC speech. We focused on improving the word intelligibility of presented BC speech in noisy environments and propose a CE method with robust CSD that combines the detection processes of voiced and unvoiced consonant sections. The evaluation of CSD procedures showed that more robust CSD procedure outperformed those of the conventional CE method as well as voiced CSD only and unvoiced CSD only. Word-intelligibility tests were also conducted on presented BC speech in noisy environments to compare the proposed and conventional methods, and the proposed method significantly improved word intelligibility over these conventional methods at a noise level of 75 dB.

Subject: INTERSPEECH.2023 - Speech Processing

#21 Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits its effect. In this work, we propose a new SE training criterion that minimizes the distance between clean and enhanced signals in the feature representation of the SSL model to alleviate the mismatch. We expect that the loss in the SSL domain could guide SE training to preserve or enhance various levels of characteristics of the speech signals that may be required for high-level downstream tasks. Experiments show that our proposal improves the performance of an SE and SSL pipeline on five downstream tasks with noisy input while maintaining the SE performance.

Subject: INTERSPEECH.2023 - Speech Processing

#22 Perceptual Improvement of Deep Neural Network (DNN) Speech Coder Using Parametric and Non-parametric Density Models [PDF] [Copy] [Kimi¹] [REL]

Authors: Joon Byun, Seungmin Shin, Jongmo Sung, Seungkwon Beack, Youngcheol Park

This paper proposes a method to improve the perceptual quality of an end-to-end neural speech coder using density models for bottleneck samples. Two parametric and non-parametric approaches are explored for modeling the bottleneck sample density. The first approach utilizes a sub-network to generate mean-scale hyperpriors for bottleneck samples, while the second approach models the bottleneck samples using a separate sub-network without any side information. The whole network, including the sub-network, is trained using PAM-based perceptual losses in different timescales to shape quantization noise below the masking threshold. The proposed method achieves a frame-dependent entropy model that enhances arithmetic coding efficiency while emphasizing perceptually relevant audio cues. Experimental results show that the proposed density model combined with PAM-based losses improves perceptual quality compared to conventional speech coders in both objective and subjective tests.

Subject: INTERSPEECH.2023 - Speech Processing

#23 DeFT-AN RT: Real-time Multichannel Speech Enhancement using Dense Frequency-Time Attentive Network and Non-overlapping Synthesis Window [PDF] [Copy] [Kimi¹] [REL]

Authors: Dongheon Lee, Dayun Choi, Jung-Woo Choi

In real-time speech enhancement models based on the short-time Fourier transform (STFT), algorithmic latency induced by the STFT window size can induce perceptible delays, leading to reduced immersion in real-time applications. This study proposes an efficient real-time enhancement model based on dense frequency-time attentive network (DeFT-AN). The vanilla DeFT-AN consists of cascaded dense blocks and time-frequency transformers, which allow for a smooth transition between time frames through a temporal attention mechanism. To inherit this advantage and reduce algorithmic latency, we develop the lightweight and causal version of DeFT-AN with dual-window size processing that utilizes synthesis windows shorter than analysis windows. The benefit of DeFT-AN in identifying temporal context enables the use of non-overlapping synthesis windows, and experimental results show that the model can achieve the highest performance with the lowest algorithmic latency among STFT-based models.

Subject: INTERSPEECH.2023 - Speech Processing

#24 Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Pärnamaa, Huaming Wang

Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices. To deploy a PSE model for full duplex communications, the model must be combined with acoustic echo cancellation (AEC), although such a combination has been less explored. This paper proposes a series of methods that are applicable to various model architectures to develop efficient causal models that can handle the tasks of PSE, AEC, and joint PSE-AEC. We present extensive evaluation results using both simulated data and real recordings, covering various acoustic conditions and evaluation metrics. The results show the effectiveness of the proposed methods for two different model architectures. Our best joint PSE-AEC model comes close to the expert models optimized for individual tasks of PSE and AEC in their respective scenarios and significantly outperforms the expert models for the combined PSE-AEC task.

Subject: INTERSPEECH.2023 - Speech Processing

#25 TaylorBeamixer: Learning Taylor-Inspired All-Neural Multi-Channel Speech Enhancement from Beam-Space Dictionary Perspective [PDF] [Copy] [Kimi²] [REL]

Authors: Andong Li, Weixin Meng, Guochen Yu, Wenzhe Liu, Xiaodong Li, Chengshi Zheng

Despite the promising performance of existing frame-wise all-neural beamformers in the speech enhancement field, it remains unclear what the underlying mechanism exists. In this paper, we revisit the beamforming behavior from the beam-space dictionary perspective and formulate it into the learning and mixing of different beam-space components. Based on that, we propose an all-neural beamformer called TaylorBM to simulate Taylor's series expansion operation in which the 0th-order term serves as a spatial filter to conduct the beam mixing, and several high-order terms are tasked with residual noise cancellation for post-processing. The whole system is devised to work in an end-to-end manner. Experiments are conducted on the spatialized LibriSpeech corpus and results show that the proposed approach outperforms existing advanced baselines in terms of evaluation metrics.

Subject: INTERSPEECH.2023 - Speech Processing